Estimating and Exploiting Language Distributions of Unlabeled Data

نویسنده

  • Alan McCree
چکیده

This paper addresses the problem of language distribution estimation from unlabeled data. We present a new algorithm that treats automated classifier identification outputs as likelihoods and iteratively applies Bayes’ rule to reclassify the data using successively improving distribution estimates as “priors”. Experimental results using the MIT LL submission to the NIST LRE07 evaluation show significant improvements in estimation of nonuniform distributions as compared to a baseline counting approach. In addition, we show how to incorporate these estimated distributions into the classification task. Further experiments on the LRE07 corpus show large gains for both the detection/verification and identification tasks when only a small set of languages are actually present in the test set.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Discovery of Informative Unlabeled Data for Improved Learning

In computer vision, the acquisition of sufficient labeled data for training is often time-consuming. However, unlabeled data are conveniently available. The key problem is to discover and incorporate those informative and confidently predicted unlabeled data into the training set for improved learning. In this paper, we discover such unlabeled data by exploiting the locality property of the dat...

متن کامل

Agreement/Disagreement Classification: Exploiting Unlabeled Data using Contrast Classifiers

Several semi-supervised learning methods have been proposed to leverage unlabeled data, but imbalanced class distributions in the data set can hurt the performance of most algorithms. In this paper, we adapt the new approach of contrast classifiers for semi-supervised learning. This enables us to exploit large amounts of unlabeled data with a skewed distribution. In experiments on a speech act ...

متن کامل

Exploiting Unlabeled Data Using Improved Natural Langua

This paper presents an unsupervised method that uses limited amount of labeled data and a large pool of unlabeled data to improve natural language call routing performance. The method uses multiple classifiers to select a subset of the unlabeled data to augment limited labeled data. We evaluated four widely used text classification algorithms; Naive Bayes Classification (NBC), Support Vector ma...

متن کامل

Estimating the class prior and posterior from noisy positives and unlabeled data

We develop a classification algorithm for estimating posterior distributions from positive-unlabeled data, that is robust to noise in the positive labels and effective for high-dimensional data. In recent years, several algorithms have been proposed to learn from positive-unlabeled data; however, many of these contributions remain theoretical, performing poorly on real high-dimensional data tha...

متن کامل

Semi-supervised Relation Extraction using EM Algorithm

Relation Extraction is the task of identifying relation between entities in a natural language sentence. We propose a semisupervised approach for relation extraction based on EM algorithm, which uses few relation labeled seed examples and a large number of unlabeled examples (but labeled with entities). We present analysis of how unlabeled data helps in improving the overall accuracy compared t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010